Skip to content

Conversation

@iverase
Copy link
Contributor

@iverase iverase commented Jun 26, 2025

The current loop is not doing the right thing when there are neighbours. We also decrease SAMPLES_PER_CLUSTER_DEFAULT as it is too high.

@iverase iverase added >non-issue :Search Relevance/Search Catch all for Search Relevance v9.2.0 labels Jun 26, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jun 26, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

static final int MAXK = 128;
static final int MAX_ITERATIONS_DEFAULT = 6;
static final int SAMPLES_PER_CLUSTER_DEFAULT = 256;
static final int SAMPLES_PER_CLUSTER_DEFAULT = 64;
Copy link
Contributor

@john-wagster john-wagster Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth including some of the runs you were doing in the PR comments just so we can look back at them if we need to to confirm recall wasn't hurt by doing this

I'll run a couple runs myself here real quick too to double check with a different model

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ran this whole PR and just the sampling change only on glove 200d 1m, 3m and 10m and for both saw no major drops in recall

* @param sampleSize the subset of vectors to use when shifting centroids
* @param maxIterations the max iterations to shift centroids
*/
public static void cluster(FloatVectorValues vectors, float[][] centroids, int sampleSize, int maxIterations) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only used for tests and kinda silly now you can just get rid of this or I can clean it up in a subsequent PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clean up in a follow up PR

Copy link
Contributor

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@iverase iverase merged commit ce74df5 into elastic:main Jun 27, 2025
32 checks passed
@iverase iverase deleted the getBestCentroid branch June 27, 2025 11:28
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jul 3, 2025
… decrease SAMPLES_PER_CLUSTER_DEFAULT (elastic#130069)

* KMeansIntermediate shares assigments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :Search Relevance/Search Catch all for Search Relevance Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants